Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: add return_inverse to pd.unique #24119

Closed
wants to merge 17 commits into from

Conversation

h-vetinari
Copy link
Contributor

This is the first part I'm splitting off of #24108, but now with full test coverage. For the moment, I've added return_inverse to pd.unique and to Categorical.unique, but it's not trivial because of inconsistencies like the following:

>>> import pandas as pd
>>> idx = pd.Index([0, 1, 1, 0])
>>> pd.unique(idx)
array([0, 1], dtype=int64)
>>>
>>> # So pd.unique(Index) yields an array, except if the Index is categorical...?
>>> idx = idx.astype('category')
>>> pd.unique(idx)
CategoricalIndex([0, 1], categories=[0, 1], ordered=False, dtype='category')

I'd be open to further split off the change for Categorical.unique, or just return NotImplemented for all ExtensionArray types. As mentioned in #24108 already, I believe that the possibility for return_inverse (or maybe even kwargs in general??) is something that should be added to the EA interface. @TomAugspurger @jreback @jbrockmendel

@pep8speaks
Copy link

pep8speaks commented Dec 5, 2018

Hello @h-vetinari! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-11 14:19:02 UTC

@h-vetinari
Copy link
Contributor Author

Failure is (thankfully) only a flaky hypothesis test.

@gfyoung gfyoung added Enhancement Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Dec 6, 2018
@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #24119 into master will decrease coverage by 0.17%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24119      +/-   ##
==========================================
- Coverage   92.37%    92.2%   -0.18%     
==========================================
  Files         166      162       -4     
  Lines       52420    51720     -700     
==========================================
- Hits        48423    47688     -735     
- Misses       3997     4032      +35
Flag Coverage Δ
#multiple 90.6% <91.66%> (-0.2%) ⬇️
#single 43.01% <25%> (+0.13%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 95.43% <100%> (-0.55%) ⬇️
pandas/core/algorithms.py 94.9% <87.5%> (+0.12%) ⬆️
pandas/io/s3.py 0% <0%> (-86.37%) ⬇️
pandas/io/sas/sasreader.py 86.2% <0%> (-9.95%) ⬇️
pandas/io/parquet.py 76.92% <0%> (-7.7%) ⬇️
pandas/io/clipboard/clipboards.py 28.23% <0%> (-2.36%) ⬇️
pandas/core/arrays/base.py 96.77% <0%> (-1.49%) ⬇️
pandas/core/computation/check.py 90.9% <0%> (-1.4%) ⬇️
pandas/core/arrays/datetimelike.py 96.35% <0%> (-1.33%) ⬇️
pandas/core/indexes/datetimelike.py 97.29% <0%> (-1.23%) ⬇️
... and 87 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb43726...006d7ad. Read the comment docs.

@codecov
Copy link

codecov bot commented Feb 2, 2019

Codecov Report

Merging #24119 into master will decrease coverage by 0.69%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #24119     +/-   ##
=========================================
- Coverage   93.07%   92.37%   -0.7%     
=========================================
  Files         192      166     -26     
  Lines       49551    52439   +2888     
=========================================
+ Hits        46119    48441   +2322     
- Misses       3432     3998    +566
Flag Coverage Δ
#multiple 90.79% <91.66%> (-1.04%) ⬇️
#single 42.86% <25%> (+0.35%) ⬆️
Impacted Files Coverage Δ
pandas/core/arrays/categorical.py 96% <100%> (-1.31%) ⬇️
pandas/core/algorithms.py 94.58% <87.5%> (-0.94%) ⬇️
pandas/io/gbq.py 25% <0%> (-75%) ⬇️
pandas/compat/__init__.py 57.91% <0%> (-36.96%) ⬇️
pandas/plotting/_misc.py 38.68% <0%> (-26.18%) ⬇️
pandas/io/common.py 72.86% <0%> (-21.25%) ⬇️
pandas/io/gcs.py 80% <0%> (-20%) ⬇️
pandas/io/s3.py 86.36% <0%> (-13.64%) ⬇️
pandas/io/formats/console.py 66.66% <0%> (-11.46%) ⬇️
pandas/core/computation/expr.py 88.68% <0%> (-8.84%) ⬇️
... and 198 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d8f9be7...786159f. Read the comment docs.

@h-vetinari
Copy link
Contributor Author

Thinks changed around a bit here with DatetimeArray etc., but this commit should work.

As mentioned in the OP, this splits off first chunk of #24108 and makes some progress towards #4087 / #21357 / #21720 / #22824. I'm sure there'll still be lots of discussion, but having an implementation is a good start (even though there's not much happening - the cython backend is already there since a few months).

The diff in test_algos.py is completely busted, but I painstakingly created some nicely modular commits to step through the changes in a sane fashion.

@h-vetinari
Copy link
Contributor Author

@jreback @jorisvandenbossche @TomAugspurger
Is it possible to give this PR some initial review? It's been lying around for ~3 months...

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Mar 10, 2019 via email

@jreback
Copy link
Contributor

jreback commented Mar 10, 2019

so again why should we add a method to do this, when we already have one?

In [16]: idx = pd.Index([0, 1, 1, 0])                                                                                                                                                                                                   

In [17]: pd.factorize(idx)                                                                                                                                                                                                              
Out[17]: (array([0, 1, 1, 0]), Int64Index([0, 1], dtype='int64'))

In [18]: idx.take(pd.factorize(idx)[0])                                                                                                                                                                                                 
Out[18]: Int64Index([0, 1, 1, 0], dtype='int64')

true numpy calls this return_inverse in .unique() but they also don't have .factorize(). I really don't like having multiple ways of doing things. I am not convinced this is actually that useful. Can you show the usecase that is not possible or (claimed) not performant that this cannot be done now?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments

@jreback
Copy link
Contributor

jreback commented May 12, 2019

closing as stale

@h-vetinari
Copy link
Contributor Author

@jreback, this is not stale, but I do see that I overlooked to answer your comment a bit further up.

I've made the case in #22824 several times, but among other things, factorize and unique do different things, namely in the treatment of missing values (and there are several cases where that difference can be crucial).

Then there's the fact that most people who are not knee-deep in stats or R won't know factorize, but unique is incredibly intuitive and matches with what numpy offers.

Please reopen this and lets have this discussion (bearing in mind that the actual goal would be #24108; this PR is just a stepping stone).

@h-vetinari
Copy link
Contributor Author

Thanks for reopening! Will try to merge master soon.

Copy link
Contributor Author

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments on the diff

@@ -94,6 +94,25 @@ of the Series or columns of a DataFrame will also have string dtype.
We recommend explicitly using the ``string`` data type when working with strings.
See :ref:`text.types` for more.


.. _whatsnew_1000.enhancements.unique:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, this was chopped off of #24108 and the section is intended to be bigger, compare here

@@ -355,19 +371,23 @@ def test_factorize_na_sentinel(self, sort, na_sentinel, data, uniques):
else:
tm.assert_extension_array_equal(uniques, expected_uniques)


class TestUnique:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that no tests are removed here (even though the diff is large). I sometimes joined tests and parametrized them with fixtures. In fact, there should be many more tests now...

assert_series_or_index_or_array_or_categorical_equal(result, expected)

# TODO: add support for return_inverse to DatetimeArray/DatetimeIndex,
# as well as [[Series/Index].unique
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this here as an indication where things should be heading.

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jorisvandenbossche @jreback
This is updated and green. Since the diff in the tests basically unreadable, I'll recall the following:

The diff in test_algos.py is completely busted, but I painstakingly created some nicely modular commits to step through the changes in a sane fashion.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2019

I am highly u likely to change my -1 on this

i view this as duplicative and confusing api

if you want to document in an example great

@h-vetinari
Copy link
Contributor Author

@jreback: i view this as duplicative and confusing api

.unique is central API, and extremely well-established (on top of being very intuitive; plus: the method exists in numpy with the inverse). You yourself suggested to add a way of doing return_inverse=True for unique.

You had also already agreed to the utility of having an inverse here (at the time you suggested to add the inverse to duplicated, which I did in #21645, only to be redirected to add the inverse to .unique where it fits more properly).

And as I won't tire of repeating: unique != factorize. The methods have different goals and differ in several key points.

I feel the conversation keeps going in circles - maybe this could be a nice example case for writing a fully fledged enhancement proposal?

@jreback
Copy link
Contributor

jreback commented Jan 26, 2020

closing. I don't think anyone has the bandwith to work with you on this.

@jreback jreback closed this Jan 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants